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The utilization of games in training the reinforcement learning (RL) agent is 
to describe the complex and high-dimensional real-world data. By utilizing 
games, RL researchers will be able to evade high experimental costs in 
training an agent to do intelligence tasks. The objective of this research is to 
generate intelligent agent behaviors in multi-agent game artificial 
intelligence (AI) using deep reinforcement learning (DRL) algorithm. A 
basic RL algorithm called deep Q network is chosen to be implemented. The 
agent is trained by the environment's raw pixel images and the action list 
information. The experiments conducted by using this algorithm show the 
agent’s decision-making ability in choosing a favorable action. In the default 
setting for the algorithm, the training is set into 1 epoch and 0.0025 learning 
rate. The number of training iterations is set to one as the training function 
will be repeatedly called for every 4-timestep. However, the author also 


Q-learning experimented with two different scenarios in training the agent and 
compared the results. The experimental findings demonstrate that our agents 
learn correctly and successfully while actively participating in the game in 
real time. Additionally, our agent can quickly adjust against a different 
enemy on a varied map because of the observed knowledge from prior 
training. 
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1. INTRODUCTION 

Technology is developed to help humans perform complicated work that either involves dangerous 
work or complex computation. Inventions such as computers and smartphones are some good examples of 
technological development that enables humans to work in a smart, simple, and efficient manner through a 
variety of smart programs. An example of a smart program is the virtual intelligence assistant developed by 
Google which can recognize and process our voice as an input, the Google Assistant. This smart program has 
a trainable intelligence that will get better in recognizing voice and processing tasks as long as it has a decent 
amount of input and training time. This man-made intelligence is called artificial intelligence (AI). 

Google DeepMind and OpenAI are companies that show AI potential in solving problems that can be 
trained in a simulated environment. Reinforcement learning (RL), one of the machine learning (ML) methods, is 
utilized by these companies to train an expert agent who outperforms humans in the game. The utilization of the 
game in training the RL agent is to describe the complex and high-dimensional real-world data [1]-[9]. By 
utilizing games, RL researchers will be able to evade high experimental costs in training an agent to do 
intelligence tasks [10]. However, RL application is still impacted by high sample complexity, especially in 
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multi-agent systems. To solve this problem, Loftin et al. [11] formally define the concept of strategically 
effective exploration in Markov games and use this to create two finite Markov game learning algorithms. 

In an end-to-end framework, the combination, known as deep reinforcement learning (DRL), 
significantly enhances the generalization and scalability of conventional RL algorithms by instructing agents to 
make decisions in high-dimensional state space, such as playing video games, controlling robots, and making 
decisions in various real-world applications. Examine the evolution of DRL research with a particular emphasis 
on AlphaGo and AlphaGo Zero [12], [13]. Research by Li [14] discusses key components, significant 
mechanisms, and a range of applications while providing an overview of DRL achievements. Meanwhile, 
Perakam et al. [15] introduce the first deep-learning model that can successfully learn control policies from 
high-dimensional sensory input. The model is a convolutional neural network that was trained using a variation 
of Q-learning, with raw pixels as its input, and an estimation of future rewards as its output. More recently, the 
RL algorithm was implemented in [15]-[22] which mostly aims to generate intelligent agent behavior. This 
research implements RL as the ML algorithm in creating an agent that could outperform humans in the Atari 
game by using the algorithm that is proposed in [23] and the agent is trained by the environment raw pixel 
images and the action list information. 


2. RESEARCH METHOD 
2.1. The Markov property 

The state in the RL should satisfy the Markov property. The Markov property defines that a current 
state completely characterizes the state of the world, hence the current state is independent both towards the 
future and past state [24]. In the agent’s environment, the agent transitions to another state through the taken 
action. If the next state could be predicted without knowing/dependent on the preceded events, then the 
mathematical equation of the property is given in (1). 


Pr{St41 = $5141 =T [St Te At, St-1 At-1 + Tr So» Qo} (1) 


2.2. Markov decision models 

The Markov decision process is the mathematical formulation of the RL defined by the tuple 
(S,A,R,P,y), where S represents the set of possible states, A represents the set of possible actions, R 
represents the distribution of the reward given a (state, action) pair, P represents the transition probability 
i.e., distribution of the next state given (state, action) pair, and y represents the discount factor. 

The Markov decision process works will be represented as the main task of the RL’s agent which is 
described through the pseudocode: i) The agent initializes by sampling the environment's initial state 
So ~ P(So) and ii) Then, from t=0 until done: agent select action a,, environment samples reward given the 
state and action given r, ~ R(.|s;,a,;), Environment sample the next state St+1 ~ P(.|St, at), and Agent 
receives reward r, and move to the next state s,44. 

Based on this, the agent policy can now be stated as m;(s,a) that specifies the choosing action 
mechanics for the agents in each state. The objective of the RL agent is to find the optimum policy 7* that 
maximize the cumulative discounted reward 14.) y'%;. The optimum policy should be stochastic to fulfill the 
Markov property. Thus, to handle the randomness, the maximum expected sum of reward is taken. 


m* = arg max E[Yes0 Y "re |7] (2) 
T 


Initial state sampled from the initial state distribution Sọ ~ p(Sọ), action sampled from the policy given by the 
state a, ~ T(. | S,), and the next state sampled from the transition probability distribution s,,, ~ pC. | Say). 


2.3. Value function and Q-value function 
Finding the optimum policy means that the agent has to learn the goodness of a state and the 


goodness of a state-action pair. The value function is the expected cumulative reward from following the 
policy from a state that quantifies the good and bad state. 


V"(s) = E[dte0 Yre [So = s, 7] (3) 


The Q-Value function at state s and action a is the expected cumulative reward from taking the action a on 
state s. 


Q”(s,a) = E[Xeso yt [So = S,d9 = a, 7] (4) 
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2.4. Q-learning 
The Q-learning uses an action-value function, Q, to approximate the optimal action-value function, 
Q*. Q-learning utilizes the Bellman Equation which is a mathematical equation that is mainly used to solve 


the optimization problem and it is mainly utilized in Dynamic Programming. The equation that is utilized for 
RL is shown in (5). 


Q(s,a) = r(s,a) + ymax Q(s’, a) (5) 


The (5) is a processed Bellman Equation that is used to fit a state and action pair into it. The Q(s, a), 
commonly referred to as the Q value, is calculated through the addition of the immediate reward r(s,a) 
added by the maximum value of the highest possible Q value from the next state (s’) in the response of taking 
an action (a) time a discount factor gamma (y). A discount factor, a number between 0 and 1, is used to 
control the importance of the short-term and long-term rewards. When given the chance to obtain a short- 
term reward, an agent will be compelled to act avariciously and grab the largest reward as soon as possible, 
which leads to the development of exploitation behavior. In contrast, a long-term reward will have the 
opposite effect. Thus, the goal of Q-learning is to maximize the future cumulative reward that could be 
achieved. The characteristic of this algorithm made it to be called a greedy algorithm. 

The Q-learning algorithm is a reliable solution in the field of RL, especially in uncharted territory. 
Known for its ease of use, Q-learning has fewer parameters, strong exploratory powers, and a convergence 
guarantee. Its independence from the requirement for an explicit model of the environment is one of its 
unique characteristics. Because of this feature, Q-learning is especially useful in situations where it is 
difficult or impractical to obtain an accurate model [25]. Because of its effectiveness in path planning, an 
area where outcomes are optimized, Q-learning has attracted a lot of attention and investigation in academic 
study. The algorithm's adaptability and usefulness in many real-world circumstances are highlighted by its 
capacity to navigate unfamiliar environments and determine optimal policies without the need for a pre- 
existing model. 


2.5. Deep Q-network 

A deep Q network combines the deep neural network with the Q-learning mechanics. The neural 
network will replace the Q values tables, and as a result, the neural network will replace the table function to 
approximate Q values. By minimizing the cost function, which will be similar to the mean square error 
function, the algorithm aims to minimize the difference between the initial learning state and the goal state 
where the Q value reaches its final converged value. The cost function of the deep Q network is defined as 


(6): 
Cost = [a(s,a; 8) — (rs, a) + y max Q(s', a))} (6) 


where Q(s,a; @) is the new state-action value function which takes trainable weights of the neural network 
(0). 

The main problem of training an agent is to introduce the agent’s independence towards state 
transition. The agent that depends on learning from the exact previous state has a risk of being trapped in the 
unwanted scenario as the agent does not have any other source to learn. Hence, experience replay is 
introduced to stochastically handle the problem. The algorithm is shown in Algorithm 1. 


Algorithm 1. Deep Q-learning with experience replay 
Initialize replay memory D to capacity N 
Initialize action-value function Q with random weights 
for episode=1, M do 
Initialize sequence sı={xı} and preprocessed sequenced ¢ġı=¢ (sı) 
for t=1, T do 
With probability e select a random action a 
otherwise select as=maxaQ* (¢(St),a; 0) 
Execute action a; in the emulator and observe reward r; and image X=ı 
Set St1=St, ae, X1 and preprocess Qe1=ġ¢ (S1) 
Store transition (@¢, at, re, de+1) in D 
Sample random minibatch of transitions (Qj, &jr rj Qj) from D 


Sak A for terminal $j 
Yj rytymax,:0($;,,,4'7 0) for non-terminal $j+1 
Perform a gradient descent step on (y;-0(¢;,aj;0))? according to equation 3 
end for 
end for 
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3. RESULTS AND DISCUSSION 

Ms. PacMan's environment satisfies the Markov properties, as the agent does not need to know the 
previous state to predict the next state. For example, the agent does not need to know how the bonus fruit 
appears in the game, instead, it could predict in the future to approach the bonus fruit when it does appear on 
the game screen. The system processes high-dimensional pixel data. Therefore, the game’s frame is pre- 
processed by the user-defined function to a smaller size (84, 84) with the grayscale color scheme. Figure 1 
illustrates the process of the game frame transformation. 


= 


Real game's frame (210, 160) Preprocessed frame (84, 84) 


Figure 1. The transformation of the preprocessed game frame 


The agent's performance in the initial episode is shown in Figure 2. The initial result when the first 
episode runs shows that the agent still trying to explore the surrounding area (Figure 2(a)). The agent 
movement is not smooth which means that the agent did not take 1 action per direction it’s heading into 
(Figure 2(b)). Instead, the agent tries different random actions that cause it to stutter around and end up dying 
while it gathers a 90 score (Figure 2(c)). 


(a) (b) (c) 


Figure 2. The agent's performance at the initial episode, (a) the interface of the agent’s initial behavior, (b) 
the agent’s collision with a ghost, and (c) the agent out of lives 


The states that will be used in the paper will be the pre-processed game frames. The state space 
consists of a lot of states as Ms. PacMan has 1293 distinct locations in the maze. A complete state of Ms. 
PacMan’s model consists of the location of Ms. PacMan, the ghosts, the power pills, along with the ghost's 
previous move, and the information on whether the ghost is edible. 

The agent has nine actions that could be performed at the game which are represented by a single 
integer. These actions are ['NOOP', 'UP’, 'RIGHT', 'LEFT', 'DOWN', 'UPRIGHT', 'UPLEFT', 
"DOWNRIGHT', 'DOWNLEFT']. Meanwhile, Ms. PacMan’s reward could be obtained by gathering the 
foods (Pac-Dot), bonus fruit (Fruit), power-up items (Power Pellet), eating a ghost, and chain-eating the 
ghosts. The reward list is shown in the Tables | and 2. 
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Table 1. Ms. PacMan's reward space-the foods and ghost-eating scores 


Image Name Score 


Pac-Dot 10 points 


Power Pellet 50 points 


1 Ghost 200 points 
2 Ghost 400 points 
3 Ghost 600 points 


4 Ghost 800 points 


Table 2. Ms. PacMan's reward space—the bonus fruits 
Image Name Score 
Cherry 100 points 


Strawberry 200 points 
Orange 500 points 


Apple 1000 points 


8 
© 
[QQ] Pretzel 700 points 
D 
> 


Pear 2000 points 


Banana 5000 points 


ie 


The testing was conducted in the local machine in the PyCharm IDE. The observation is conducted 
by observing the output of the neural network per game episode, analyzing the video output, and creating a 
gameplay graph. The initial result when the first episode runs shows that the agent still trying to explore the 
surrounding area. The agent's movement is not smooth which means that the agent did not take 1 action per 
direction it’s heading into. Instead, the agent tries different random actions that cause it to stutter around the 
hall and end up dying. At the end of its life, the agent manages to gather 220 episodes without any utilization 
of power pills which can power up Ms. PacMan to eat the ghost without dying. 

In the default setting for the algorithm, the training is set into 1 epoch and 0.0025 learning rate. The 
number of training iterations is set to one as the training function will be repeatedly called for every 4- 
timestep. However, the author also experimented with two different scenarios in training the agent and 
compared the results. The full training scenario is shown in Table 3. 


Table 3. Testing scenario 
Epoch _ Learning rate 


Case A 1 0.3 
Case B 100 0.025 
Default 1 0.025 


The first scenario, case A, is to set 1 epoch and 0.3 learning rate. The time used to train 100 episodes 
with 1 epoch and 0.3 learning rate is 6 hours. During its 100th gameplay, the agent has undergone 12,595 
training sessions. The agent got a 180 score when it reached the 100th episode. The scores from episode 80 to 
episode 110 mostly dominated around the range of 200-250, the highest score that the agent can reach occurred 
at episode 97 with the score 590. In this case, the agent still tends to act like the initial test. 

The author uses 100 epochs to test whether the agent could run well if the number of training is 
increased. The second scenario, case B, is to set 100 epochs and 0.025 learning rate. The time used to train 
100 episodes is 1 day and 2 hours, or 26 hours in total. During its 100th gameplay, the agent has undergone 
13,023*102 training sessions. In this scenario, the agent behavior has resulted in a more consistent 
exploration, detecting that a lot of the road is emptied from the food that is previously lying around. In this 
scenario, there exist three high scores achieved by the agent which are found at episodes 85, 100, and 104 
with a value of 940, 1,040, and 1,340. Seeing the amount of resources that are exhausted in this testing 
scenario, the author decides that this testing scenario is not feasible to be tested. 
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For the default case, the training is conducted in three different phases because the author 
encountered a technical error during the training session which interrupted the training session. The first 
phase has 547 episodes which is running for 29 hours and 27 minutes. The second phase has 601 episodes 
which is running for 9 hour and 43 minutes. The third phase has 1,001 episodes which is running for 20 hours 
and 8 minutes. Hence, in total, the default scenario was going through 2,149 episodes in 59 hours and 19 
minutes. In this scenario, the agent can reach 100 episodes/hour. However, as the training went on, the 
training speed declined to around 35-40 episodes/hour starting from the 700th—1,001th episode of the second 
phase. The training speed declined as the agent consumed a lot of memory and storage throughout the 
training session. At the start of a training session, the agent only needs less than 2 GB of RAM whereas at the 
halfway of the training, around the 600th episode, the agent consumes around 6.5-8.7 GB of RAM. 

In the first phase, the agent explored any possible actions it could take which is shown by the low 
score achieved (ranging around 200-400 points) and stuttering behavior. Thus, the number of high scores 
achieved at this point cannot be said as a result of an intelligent decision. The average scores graph, depicted 
in Figure 3, shows the trends of improvement in the agent’s performance through 547 episodes. 

The agent started to frequently achieve scores of more than 400 points as shown in the scores history 
graph depicted by Figure 4. The number of high scores achieved also shows an improvement as the score 
graph is updated to more than 2,500 points. The epsilon is set to follow the progress that has been made in 
the first phase with a 0.05 value addition. 

At the third phase of the training session, the agent is also loaded with the second phase training’s 
weight and the latest epsilon of the second phase. The agent's performance at the beginning is much more 
improved than the agent’s performance at the beginning of the first phase. At the end of the third phase, the 
agent reached a maximum score at the 462"! episode with 4,020 points as shown in Figure 5. Around the 
700th episode, the maximum average score peaked at 877.7. The second evaluation agent gained 651.8 
evaluation points. 

The agent performance in the default case shows promising behavior (indicated by the agent’s 
performance improvement in achieving a new higher score). The average scores history graph is presented to 
give more details on the agent’s performance through Figure 6. The figure shows that throughout the learning 
the agent has shown a good trend of improvement. However, after the agent passes the 500th episode, the 
agent shows stagnant performance that almost leads to performance deterioration. 

An additional experiment was carried out on a simpler game environment, Breakout. Breakout is a 
game where a layer of bricks is stationed on the top third of the screen and the goal of the game is to destroy 
all of the bricks using a bouncing ball that the player can hit using a panel that can move horizontally. The 
agent shows similar performance in the game of Breakout as shown by the average score graph in Figure 7. 
In 1000 episodes, the agent still iterates through the exploration phase as shown by the fluctuation of the 
average score graph. 

In this research, the author implements the algorithm featured in [9], where the result of this paper is 
shown in Figure 8. However, due to the experimental parameters, which are a combination of experimental 
and recommended values, the code’s performance may not be optimal. There are a lot of sources that give a 
better approach to either construct the algorithm or set the recommended parameters, but these approaches 
require more expert knowledge that needs to be understood and studied which requires a lot of time. 


Gameplay plot train plot_599 


2500 


2000 


500 


Figure 3. The first phase score graph Figure 4. The second phase score graph 
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Figure 8. The average reward per episode on Breakout and Seaquest result in paper [9] respectively during 
training 


4. CONCLUSION 

Even though the agent training progress shows a good trend of improvement, this research cannot be 
said successful in creating an agent that could master and exploit Ms. PacMan. The author found that the 
choosing action mechanism works as intended. The agent first discovers new possibilities until the agent 
exploits a good approach to get a better result. The agent’s got a stagnant score after passing the 500th 
episode followed by the downfall in the latest episode. This pattern is assumed to happen because of the 
agent’s behavior to exploit the action that is not resulting in the desirable result as the agent’s performance 
and behavior around 1000 episodes is not perfect which means the agent is still learning. Hence, the agent 
that exploits the non-desired action got a downgrade in its performance. 


Int J Adv Appl Sci, Vol. 12, No. 4, December 2023: 396-404 


Int J Adv Appl Sci ISSN: 2252-8814 o 403 


At the end of the experiment, the agent can move with less stuttering per frame explore a large 
amount of area, and utilize some power pills before it runs out of life. Looking at the results of the 
experiment, the author believes that the RL could be a great approach to solving real-world problems. The 
author encourages other students to study this field. It should be noted that further knowledge is required as 
the real-world problem is more complicated than a game which means that a basic RL algorithm will only be 
fundamental to solve the problem. 
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